Introduction

This vignette focusses on downstream analysis of the output from scanem. After running scanem, it outputs two or three different types of information:

  1. The weights of the convolutional kernels, or ‘motif detectors’. Conceptually, they are similar to PWMs
  2. The alignments of convolutional kernels to a database of motifs (if such a database is available), with significance (q-)scores for each alignment
  3. The weights of the neural network, representing the importance of the motif impacts in each pool

The central part of scanem trains 10 different neural network models, each using a different 90% of the dataset for training and the remaining 10% for testing. Each model contains 300 motif detectors, so there are a total of 3,000 motifs. To ensure that the results are robust, scanem seeks to identify motif clusters that show up repeatedly, therefor we require that a motif cluster should show up in at least half of the models.

The motifs are named as MODELNUMBER_MOTIFNUMBER, e.g. motif 2_3 is motif 3 from model 2. The best performing out of 10 models is named “best” rather than its model number, e.g. motif best_129.

All of the motifs are aligned with Tomtom, the motif comparison tool from the MEME Suite, against a database of known motifs. This generates a tab-separated file containing what motifs aligned to what reference motifs, and the q-values for the alignments. This result can be represented as a bipartite graph where the model motifs and the database motifs represent the two categories of nodes. Then it becomes possible to select reproducible motifs. For a visual overview, see the figure below.

One of the outputs from the model is a HDF5 file containing motif detector and neural network weights for each model. The neural network weights are weights for each motif detector in each pool - the network uses the weights to determine the importance of motifs in the different pools.

Below is a visual overview of the workflow for motif analysis using scanem. The downstream analysis described in this section starts at the third step, where we have obtained weights from the models and the motif alignments from Tomtom.

scanem_output_workflow

Loading the required packages and helper functions

During this workflow, I will use the following packages: (see end of the page for session info):

suppressMessages(library(dplyr))
suppressMessages(library(ggplot2))
suppressMessages(library(igraph))
suppressMessages(library(mixtools))
suppressMessages(library(rhdf5))
suppressMessages(library(SingleCellExperiment))
suppressMessages(library(stringr))
suppressMessages(library(xtable))
suppressMessages(library(ggrepel))
suppressMessages(library(pheatmap))
suppressMessages(library(magrittr))
suppressMessages(library(reshape2))

options(stringsAsFactors = F)

I have added some custom functions to the following file:

source("scanem_helper_functions.R")

Set up paths to files

For the rest of the analysis, I am using paths to the network output (exp_folder), the input dataset (data_path), and the cell type annotation (colData_path). I also need to input the motif database that was used for motif alignment (motif_db_path). I will also create an output directory for the analysis.

# Network output dir
exp_folder <- "~/jh47/maydev_2/output/tm_7organs_pool80_5u5d/"

# Network input datasets
data_path <- "~/jh47/maydev_2/data/20200901_tm_7organs_pool80_5u5d.tsv.gz"
colData_path <- "~/jh47/maydev_2/data/20200901_tm_7organs_pool80_5u5d_colData.tsv"

# Motif alignment db
motif_db_path <- "~/jh47/maydev_2/resources/Mus_musculus.meme"

# Output directory
outdir <- "20200909_test_out"
dir.create(outdir)
## Warning in dir.create(outdir): '20200909_test_out' already exists

Reading the dataset, alignments, and network output

Using the paths defined above, I can use the functions from scanem_helper_functions.R to read the different parts of the outputs from the neural network for downstream analysis. Firstly, I will construct a SingleCellExperiment object using the pooled dataset. Secondly, I will read the Tomtom output and filter it to include only the genes that are expressed in the dataset. Thirdly, I will read the weights matrix and scale it.

For the third step, I will scale the weights in the neural network layer by the maximum absolute motif activation. This is all done using the read_network_hdf5() function.

sce <- scanem_dataset_to_sce(data_path = data_path, 
                      colData_path = colData_path)

tomtom_table <- read_tomtom_output(motif_db_path = motif_db_path,
                                    exp_folder = exp_folder)

tomtom_table <- filter_tomtom_table(tomtom_table = tomtom_table, sce = sce)

weights_matrix <- read_network_hdf5(exp_folder, sce)

Motif alignment graph

I will now use the Tomtom alignments to construct a bipartite graph. I then use igraph::cluster_walktrap() to cluster these alignments, and I use plot_alignment_graph() to generate two files in the outdir: a motif alignment plot with cluster annotation and another without annotated clusters. Although these plots can often be rather messy, they nevertheless provide a good overview of how many different motif clusters are found.

alignment_graph <- create_alignment_graph(tomtom_table)

# Cluster alignment graph
alignment_graph_clusters <- igraph::cluster_walktrap(alignment_graph)
alignment_graph_cluster_groups <- igraph::groups(alignment_graph_clusters)

# Plot two alignment graphs
plot_alignment_graph(outdir, alignment_graph, alignment_graph_clusters)

To get the motif clusters into a dataframe, I use a custom function:

cluster_df <- construct_cluster_df(exp_folder = exp_folder, 
                                   alignment_graph_cluster_groups = 
                                     alignment_graph_cluster_groups)

I also want to construct a matrix where the motif weights are averaged per cell type:

average_weights_matrix <- construct_average_weights_matrix(weights_matrix)

I use the get_motif_stats function to get information about the motifs in this matrix. This will also plot the motif annotation as “good” or “bad”:

motif_stats <- get_motif_stats(cluster_df = cluster_df,
                               average_weights_matrix = average_weights_matrix, 
                               pseudocount = 1e-11)
## number of iterations= 6 
## Some 'dead motifs' found

The motif “impact” is the weight, i.e. how much a motif will affect the expression of cells In many runs you will encounter a situation where a subgroup of motifs has much lower impact than the others. (I suspect this is because of “dead ReLUs” in the network). The previous function has annotated the lower impact motifs as “bad”. (If this function does not split good/bad motifs well for you, it might be worth changing the pseudocounts option of this function to a lower value).

The get_motif_stats() function will also create a preliminary motif cluster annotation which gives an indication of what kind of motifs are included in the cluster. However, this will need to be refined (which we will get to later).

For further analysis, I will only consider motifs from motif clusters that show up in >=50% of the models, that aligned to known motifs, and that are annotated as “good”.

Filter the values so that it only keeps the good motifs:

# Get names of good motifs: the ones that have high impact, are from 
# reproducible motif clusters, and are aligned to known motifs
good_motifs <- rownames(motif_stats)[motif_stats$is_good_motif=="Yes" &
                                       motif_stats$cluster_reprod >= 0.5 &
                                       motif_stats$cluster_annot != "NA"]

weights_matrix <- weights_matrix[good_motifs,]
motif_stats <- motif_stats[good_motifs,]
average_weights_matrix <- average_weights_matrix[good_motifs,]

Plotting the motif weights across cell types

We can plot the average weights of the motifs across the cell types as a heatmap:

# Only use this function if you have 10 or fewer motif clusters at this point.
annotation_colors <- get_annotation_colors(motif_stats)

# The "drop = FALSE" will make sure the annotation stays a dataframe
pheatmap(average_weights_matrix, 
         annotation_row = motif_stats[,c("cluster_annot"), drop = FALSE], 
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cellwidth = 8, 
         cellheight = .5)

Here, each row corresponds to a motif detector and each column corresponds to a cell type.

As you can see, a small subset of the motifs near the top has the largest impact. If you have good eyes, you may be able to tell that it is the Gabpa/Erg/Etv5/... cluster (ETS family motifs). One other thing to note is that many of the motif clusters do not cluster together very well.

If we instead plot Z-transformed motif weights, it looks like this:

# Use custom function to get 
average_weights_matrix_z_scores <- to_z(average_weights_matrix)

pheatmap(average_weights_matrix_z_scores, 
         annotation_row=motif_stats[,c("cluster_annot"), drop=FALSE], 
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cellwidth = 8, 
         cellheight = .5)

You can see from the leftmost column that motifs from the same family cluster much better. Please keep in mind that the relative impacts of motifs in a single cell type can no longer be deduced from this heatmap, as the values have been transformed per row. The advantage of this representation is that it emphasizes the relative weights of a motif across cell types.

Z-scores can, however, inflate the differences between cell types. This is why it is beneficial to have another score that tells you how “cell-type-specific” a motif is based on the weights. Thus, we calculate a “motif cell type entropy” score for each motif. This is defined as the normalised Shannon entropy of a motif weight across cell types. A value of 1 indicates that the weights are the same across all cell types and lower scores indicate a more cell type specific distribution.

Let’s plot the motifs along with the entropy scores. I will also order the motifs for their motif cluster annotation:

pheatmap(average_weights_matrix_z_scores[order(motif_stats$cluster_annot),], 
         annotation_row=motif_stats[,c("cluster_annot", 
                                       "motif_celltype_entropy")],
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cluster_rows = FALSE, 
         cellwidth = 8, 
         cellheight = .5)

Now, the added motif_celltype_entropy annotation in white/green will tell you how cell type specific the weights are (whiter = more specific). As you can see, the E2f1 cluster is the most specific with high weights for neurons, while other clusters, such as Zfp143/..., are less specific.

Refining motif cluster annotations

The current motif cluster annotation is not very specific, but this is a consequence of the fact that most motifs can be bound by multiple TFs making it impossible to tell which one is binding. Nevertheless, if expression information is available we can take advantage of it to identify the most likely candidate. The underlying assumption here is that TFs with a higher (absolute) correlation are more likely to be involved in the regulation. Please note that this is one of the more bespoke steps in this analysis and it is difficult to fully automate.

The following code will produce plots in the outdir that show the expression pattern for the different cluster candidates, including a Spearman correlation score for the different cluster candidate TFs. In addition, it will generate directories per motif cluster that contain the different plots that show how the expression and weight values are related. It will also return a DataFrame that contains the relevant information

cluster_correlations_df <- plot_candidate_tf_information(cluster_df = cluster_df, 
                              weights_matrix = weights_matrix, 
                              motif_stats = motif_stats, 
                              sce = sce, 
                              tomtom_table = tomtom_table)

As an example we consider the cluster with Yy1 and Mbtps2. One of the figures generated by the above function shows the expression levels of the different genes in the family across cell types:

scanem_output_workflow

On the left we can see that Yy1 has a higher correlation to the motif weights (green) and more Tomtom alignments (purple) than Mbtps2, making it the more likely candidate.

Let’s take a closer look at the correlation plots for these two:

Yy1 Mbtps2

Here, the cell types are annotated by color. There is also a plot generated by plot_candidate_tf_information that adds cell type labels, but those plots can get quite busy when there are many cell types present.

We can now update motif cluster annotations using the correlation information. Sometimes it is hard to identify a single transcription factor, in which case I annotate the cluster with the motif family names. Sometimes it can be good to look at individual motifs in the motif cluster and look directly at the alignments in Tomtom; for this, see the all_tomtom/tomtom.html webpage generated by Tomtom.

I do this using the following code (I look at View(cluster_df[cluster_df$cluster_reproducibility >= 0.5,]) to see which number corresponds to which cluster):

cluster_df$cluster_annot[rownames(cluster_df) == "4"] <- "Egr1/Wt1/Klf family motifs"
cluster_df$cluster_annot[rownames(cluster_df) == "7"] <- "ETS family motifs"
cluster_df$cluster_annot[rownames(cluster_df) == "16"] <- "Yy1"
cluster_df$cluster_annot[rownames(cluster_df) == "2"] <- "bHLH family motifs"
cluster_df$cluster_annot[rownames(cluster_df) == "6"] <- "bHLH family motifs 2"
cluster_df$cluster_annot[rownames(cluster_df) == "10"] <- "Zfp143"

# Update the annotation
motif_stats <- get_motif_stats(cluster_df = cluster_df,
                               average_weights_matrix = average_weights_matrix, 
                               pseudocount = 1e-11)
## number of iterations= 75 
## No clear bimodal distribution: likely not a lot of 'dead motifs'

annotation_colors <- get_annotation_colors(motif_stats)

pheatmap(average_weights_matrix_z_scores[order(motif_stats$cluster_annot),], 
         annotation_row=motif_stats[,c("cluster_annot", 
                                       "motif_celltype_entropy")],
         show_rownames = FALSE, angle_col = 45, 
         annotation_colors = annotation_colors, 
         cluster_rows = FALSE, 
         cellwidth = 8, 
         cellheight = .5)

Plotting all motif family correlations

It is helpful to be able to view the correlations for all TFs related to the motif families reported by the network. Using the following function, you can plot the correlations and expression values for the different motif families, while annotating the top x hits per motif family:

plot_motif_family_correlation_overview(cluster_correlations_df, top_n = 5)
## Warning: Removed 121 rows containing missing values (geom_label_repel).

Plotting motif family weights across cell types

It might make more sense to look at the motif weights individually to assess how the weights vary across cell types. For this I use the following function:

plot_motif_family_weights_across_celltypes(weights_matrix = weights_matrix,
                                           sce = sce, 
                                           motif_stats = motif_stats, 
                                           motif_family = "ETS family motifs")

(If this plot returns a blank plot for you, the motif_family you specified is probably not spelled correctly. You can check unique(motif_stats$cluster_annot) to find out which motif families are available)

As you can see here, the ETS family motifs are higher in immune cells (on the right) and lowest in neurons (on the left).

Another example:

plot_motif_family_weights_across_celltypes(weights_matrix = weights_matrix,
                                           sce = sce, 
                                           motif_stats = motif_stats, 
                                           motif_family = "Egr1/Wt1/Klf family motifs")

sessionInfo()

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4              magrittr_1.5               
##  [3] pheatmap_1.0.12             ggrepel_0.8.2              
##  [5] xtable_1.8-4                stringr_1.4.0              
##  [7] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
##  [9] DelayedArray_0.12.3         BiocParallel_1.20.1        
## [11] matrixStats_0.56.0          Biobase_2.46.0             
## [13] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1        
## [15] IRanges_2.20.2              S4Vectors_0.24.4           
## [17] BiocGenerics_0.32.0         rhdf5_2.30.1               
## [19] mixtools_1.2.0              igraph_1.2.5               
## [21] ggplot2_3.3.1               dplyr_1.0.0                
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6           lattice_0.20-40        digest_0.6.25         
##  [4] R6_2.4.1               plyr_1.8.6             evaluate_0.14         
##  [7] pillar_1.4.4           zlibbioc_1.32.0        rlang_0.4.6           
## [10] kernlab_0.9-29         Matrix_1.2-18          rmarkdown_2.2         
## [13] labeling_0.3           splines_3.6.3          RCurl_1.98-1.2        
## [16] munsell_0.5.0          compiler_3.6.3         xfun_0.14             
## [19] pkgconfig_2.0.3        segmented_1.1-0        htmltools_0.4.0       
## [22] tidyselect_1.1.0       gridExtra_2.3          tibble_3.0.1          
## [25] GenomeInfoDbData_1.2.2 viridisLite_0.3.0      crayon_1.3.4          
## [28] withr_2.2.0            MASS_7.3-51.5          bitops_1.0-6          
## [31] grid_3.6.3             gtable_0.3.0           lifecycle_0.2.0       
## [34] scales_1.1.1           stringi_1.4.6          farver_2.0.3          
## [37] XVector_0.26.0         viridis_0.5.1          ellipsis_0.3.1        
## [40] generics_0.0.2         vctrs_0.3.1            Rhdf5lib_1.8.0        
## [43] RColorBrewer_1.1-2     tools_3.6.3            glue_1.4.1            
## [46] purrr_0.3.4            survival_3.1-8         yaml_2.2.1            
## [49] colorspace_1.4-1       knitr_1.28